Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String dtype: use 'str' string alias and representation for NaN-variant of the dtype #59388

Merged

Conversation

jorisvandenbossche
Copy link
Member

As we discussed for PDEP-14, the idea is that we would keep "string" for the NA-variant of the string dtype (for backwards compatibility), and then to have a distinguishing string representation and alias, use "str" for the default NaN-variant.

This PR updates StringDtype to make that distinction.

One point of discussion that came up while implementing this, what to do with dtype == "string" ?

We have quite some places internally in the code where we use this to check for a StringDtype in general (instead of something like isinstance(dtype, StringDtype)). But should that exclusively return True for the NA-variants of the StringDtype? (to match the string alias/repr)
For now I kept a special case to let this return True, but so I could also update all cases in our own code where we currently use this pattern, and change dtype == "string" to return False for the NaN-variant (dtype == "str" returns True for the NaN variant, and False for the NA variant)

xref #54792

@jorisvandenbossche jorisvandenbossche added the Strings String extension data type and string data label Aug 2, 2024
Comment on lines +168 to +169
# TODO add more informative repr
return self.name
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To discuss in #59342

Comment on lines +175 to +176
# TODO should dtype == "string" work for the NaN variant?
if other == "string" or other == self.name: # noqa: PLR1714
Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This other == "string" is added as special case to (for now) keep dtype == "string" working also for the NaN-variant. See the top-post of this PR for some context.

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review August 6, 2024 08:13
@@ -224,7 +224,7 @@ def test_apply_categorical(by_row, using_infer_string):
result = ser.apply(lambda x: "A")
exp = Series(["A"] * 7, name="XX", index=list("abcdefg"))
tm.assert_series_equal(result, exp)
assert result.dtype == object if not using_infer_string else "string[pyarrow_numpy]"
assert result.dtype == object if not using_infer_string else "str"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should use this as an opportunity to move away from string comparisons and use the pd.StrDtype instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it in some places to an explicit StringDtype(..) creation where we were testing a series of dtypes, but for now kept using the string alias for convenience in other places.

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, feel free to merge after CI is green

@jorisvandenbossche
Copy link
Member Author

One point of discussion that came up while implementing this, what to do with dtype == "string" ?

While chatting with Patrick/Brock/Will, we were thinking to for now keep this returning True also for the NaN-variants (as the current PR was doing)

@jorisvandenbossche jorisvandenbossche merged commit a2fb11e into pandas-dev:main Aug 8, 2024
45 checks passed
@jorisvandenbossche jorisvandenbossche deleted the string-dtype-alias branch August 8, 2024 12:35
WillAyd pushed a commit that referenced this pull request Aug 13, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 14, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 15, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 15, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 15, 2024
@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Aug 20, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 21, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024
WillAyd added a commit to WillAyd/pandas that referenced this pull request Aug 27, 2024
WillAyd added a commit to WillAyd/pandas that referenced this pull request Sep 20, 2024
jorisvandenbossche pushed a commit to WillAyd/pandas that referenced this pull request Oct 2, 2024
jorisvandenbossche pushed a commit to WillAyd/pandas that referenced this pull request Oct 2, 2024
jorisvandenbossche pushed a commit to WillAyd/pandas that referenced this pull request Oct 3, 2024
jorisvandenbossche pushed a commit to WillAyd/pandas that referenced this pull request Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants